Skip to content

Added a SAMTOOLS_PIPELINE to run multiple samtools commands at once#4571

Closed
muffato wants to merge 36 commits intomasterfrom
samtools_pipeline
Closed

Added a SAMTOOLS_PIPELINE to run multiple samtools commands at once#4571
muffato wants to merge 36 commits intomasterfrom
samtools_pipeline

Conversation

@muffato
Copy link
Copy Markdown
Member

@muffato muffato commented Dec 11, 2023

I've recently spent some time optimising the resource usage of some Nextflow pipelines and I found that the samtools workflow to mark duplicates takes 5x longer to run and 3x more disk space when implemented the nf-core way, i.e. as 4 modules put together in a sub-workflow, compared to having a single local module with a bash pipeline.

Last time we discussed this sort of module, #3310 and https://nfcore.slack.com/archives/CJRH30T6V/p1682062426042809 , it was decided against though we didn't know the actual improvements. But both @drpatelh and @SPPearce suggested to reopen the discussion after seeing the figures. After all, the nf-core guidelines don't forbid it (bold type mine).

Software that can be piped together SHOULD be added to separate module files unless there is a run-time, storage advantage in implementing in this way.

However, @drpatelh said that there should be options to disable some of the 4 commands, because not anyone may need them. Here is my take on this.

I introduce a new module that can run a bash pipeline (hence the name) of any commands the user wants in any order, as long as they take BAM/CRAM as inputs as outputs. It wasn't as trivial as I thought it would because samtools commands have different ways of defining the input and output files 🙃

It works like this:

workflow {
    input = [
        [ id:'test', single_end:false ], // meta map
        file(params.test_data['sarscov2']['illumina']['test_paired_end_bam'], checkIfExists: true)
    ]
    commands = ['collate', 'fixmate', 'sort', 'markdup']
    SAMTOOLS_PIPELINE ( input, [[],[]], commands )
    SAMTOOLS_PIPELINE.out.output
}
process {
    withName: SAMTOOLS_PIPELINE {
        ext.args = [
            fixmate: '-m',
        ]
    }
}

What I like about the approach:

  • No mulled container involved. Just the regular samtools, which is much easier to understand and update.
  • Easy to add more samtools commands to the module.
  • Maximum flexibility for the user in terms of chaining samtools commands.

What I dislike about the implementation:

  • I couldn't find how to dynamically get the value of task.ext.args${index}. So instead I make ext.args a map where the key is the command name and the value is the string to add to the command-line. I don't know if/how closures can be used. I would very much prefer to use the regular task.ext.args, task.ext.args2, etc.
  • Because of the above, my implementation doesn't handle the same command being used multiple times in the pipeline with different command-line options.
  • The configuration of the module is actually split between the .nf (commands = ['collate', 'fixmate', 'sort', 'markdup']) and the .config (ext.args = [ ... ]). Maybe I should move the definition of the commands to the .config ?
  • Indentation of the resulting .command.sh is very fragile. What I've got works but for instance I tried pipeline_command += "samtools ${this_command} ${all_args[index]} \\ \n" and that caused the cat <<-END_VERSIONS > versions.yml` to be indented, which broke the generation of the yml.

What it needs:

  • testing with CRAM. I'm no sure I put the --reference in the right place.
  • migration to nf-test.

What it may need:

  • The selected commands often have command-line options to take extra files as inputs or output more files. Those could be added to the module, though to be fair the existing modules often don't list every single possible option either.
  • Some samtools-based sub-workflows could be rewritten to use this module, bringing an instant performance boost to their workflows.
  • A different name ? I chose "pipeline" because it's a bash pipeline under the hood. @mashehu doesn't like it :) Alternatives: pipe, chain, any, multi, combine. Ideally it should be name that stands out from the already-crowded list of samtools commands
  • A simple sub-workflow wrapper so that people looking for samtools sub-workflows still find about this. (they may not think about checking modules)
  • Alternatively, only expose this capability as a sub-workflow, either by allowing "local" modules in nf-core sub-workflows and moving the file over there, or by embedding this process block in the sub-workflow's main.nf.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • PROFILE=docker pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware
    • PROFILE=singularity pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware
    • PROFILE=conda pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware

@muffato muffato added new module Adding a new module WIP Work in progress labels Dec 11, 2023
@muffato muffato self-assigned this Dec 11, 2023
@maxulysse
Copy link
Copy Markdown
Member

Love the idea

Copy link
Copy Markdown
Contributor

@mashehu mashehu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't call it pipeline, that is already too overloaded.
I am also not 100% sure about the universality of this. Feels like we are basically moving away from atomic modules again and one needs to dig through several configs to read through several configs and docs to figure out what the module/command is actually doing

@muffato
Copy link
Copy Markdown
Member Author

muffato commented Dec 11, 2023

Hi guys. You were quick 😉 ! I am still testing a few things locally. I was then going to fill in the body of the PR to explain how this came up and share the link on Slack for discussion. I know we already discussed "samtools sormadup" and decided against, so I don't expect an easy ride 😅 but I think there are good reasons to still do something on this topic.

@SPPearce
Copy link
Copy Markdown
Contributor

As someone whose cluster is currently running out of scratch space, I definitely support the idea of chaining multiple samtools operations into one module, producing lots of individual bam/cram files uses a lot of memory when not necessary.
We already pipe bwa directly into samtools for this exact reason, bam/cram files can be BIG.

@muffato muffato marked this pull request as ready for review December 13, 2023 08:32
@muffato muffato requested review from a team and vagkaratzas December 13, 2023 08:32
@muffato muffato added Ready for Review and removed WIP Work in progress labels Dec 13, 2023
Comment thread modules/nf-core/samtools/pipeline/main.nf Outdated
@mahesh-panchal
Copy link
Copy Markdown
Member

mahesh-panchal commented Dec 13, 2023

I definitely approve of the idea.

In terms of implementation, I would skip the for loop and only use if then else to make an array, and then join the string array with the pipe symbol when evaluating the string array in it's execution place. I.e.
L94 would look like:

${samtools_commands.join(' | ')}

I'm starting to think we need to revisit also how args are passed to modules. I would prefer that the args are indexed with the 'tool_subtool' key in this case for readability.

I agree with the overloading of word pipeline too. I like the word compose but I'm not sure how to integrate it as I think the purpose of the composition also needs to be clear in the name as there could be quite a few compositions of a single tool after a time.

@muffato
Copy link
Copy Markdown
Member Author

muffato commented Dec 13, 2023

In terms of implementation, I would skip the for loop and only use if then else to make an array, and then join the string array with the pipe symbol when evaluating the string array in it's execution place. I.e.

I like the idea of join. Will definitely implement it. But I still need a for loop of some sort to go over the command names, right ? If not an actual for, then perhaps a .collect with a closure ?

@mahesh-panchal
Copy link
Copy Markdown
Member

I can't read properly on the phone ...
I'm not sure what to call these non-atomic modules. I think the name should reflect the composition, but perhaps following the subworkflow naming scheme works best here.

Comment thread modules/nf-core/samtools/pipeline/main.nf
@SPPearce
Copy link
Copy Markdown
Contributor

How does something very explicit, like samtools multicommand sound?
I don't think we want a lot of these type of modules, only very common ones like samtools (bedtools perhaps).

@mahesh-panchal
Copy link
Copy Markdown
Member

I think it would be better if we separated these aggregate commands in their own folder, like subworkflows are. It makes them easier to find, and highlights they're not atomic. Perhaps call the folder aggregates. I like the samtools_multicommand. I was thinking samtools_composepipe, or samtools_streamcommands, but I think they sound odd.

@muffato
Copy link
Copy Markdown
Member Author

muffato commented Dec 18, 2023

I think there's too much emphasis on the (non-)atomicity of this. The nf-core guidelines don't prohibit multiple commands running in a module. There is already a lot of modules that run multiple commands (the best proxy is to search for mulled containers: they typically indicate that the module needs another tool from another package).

In my mind, I map "modules" to Nextflow process and "sub-workflow" to Nextflow workflow, and I'm not keen on introducing another term for non-atomic modules, which are still process. Or I would argue to move all the existing non-atomic modules over there ;)

As an alternative, what about naming this: modules/nf-core/custom/multisamtools ? Still a "module" (because a process) but under the custom group because it's obviously customised.

@mahesh-panchal
Copy link
Copy Markdown
Member

mahesh-panchal commented Dec 18, 2023

While the guidelines don't prohibit multiple commands the reason is because the modules would be impractical without them. The modules are supposed to be as simple as possible without making them impractical. There may also be examples where this isn't the case, but there several modules that don't fully follow guidelines.

For findability, I do think this needs to be separated in some way. This kind of process definitely follows what the purpose of subworkflows are supposed to be.

As it stands, there are also several modules that are only used in a certain sequence and would benefit from being aggregated into a single process. This is part of a larger issue to me.

@SPPearce
Copy link
Copy Markdown
Contributor

I think this could slot into the subworkflows folder, rather than into a separate folder. I don't expect to be making too many of these chained modules like this.

@muffato
Copy link
Copy Markdown
Member Author

muffato commented Dec 19, 2023

Well, I did mention sub-workflows in my OP

  • A simple sub-workflow wrapper so that people looking for samtools sub-workflows still find about this. (they may not think about checking modules)
  • Alternatively, only expose this capability as a sub-workflow, either by allowing "local" modules in nf-core sub-workflows and moving the file over there, or by embedding this process block in the sub-workflow's main.nf.

The first technical question is: can nf-core sub-workflows have "local" modules ? I can't really see room for that in the repository

@mahesh-panchal
Copy link
Copy Markdown
Member

@nf-core/core we need a decision where to place this, because this is a necessity.

@adamrtalbot
Copy link
Copy Markdown
Contributor

I'm all in favour of more piping, I think we need to use it much much more to speed our pipelines up.

This is a pretty cool solution, but fairly complex? I'm not sure it will be very obvious to a new user how to give it a go. A number of modules that are defined by what they do is a simpler approach, fits more naturally with nf-core/modules and is more approachable to new users. e.g., let's imagine SAMTOOLS_BAMTOFASTQ process:

samtools collate -u -O $bam \
    | samtools fastq -1 ${prefix}.1.fastq.gz -2 ${prefix}.2.fastq.gz -0 /dev/null -s /dev/null -n

Clear, readable, quick to understand. In a pipeline it's obvious what it's doing:

SAMTOOLS_BAMTOFASTQ(bam)

This is the current solution, which is fine but not obvious to a new developer?

include { SAMTOOLS_PIPELINE as SAMTOOLS_BAMTOFASTQ }

SAMTOOLS_BAMTOFASTQ(bam, [], ['collate', 'fastq'])

@muffato
Copy link
Copy Markdown
Member Author

muffato commented Dec 19, 2023

@adamrtalbot . Then maybe #3310 could be reopened because it's exactly what you're advocating for ?

This PR is for a follow-up module that allows enabling/disabling the components of the command, without having to write a new module for every combination.

FYI my PR is currently only for BAM→BAM transformations. samtools fastq is not yet supported.

@mahesh-panchal
Copy link
Copy Markdown
Member

While this would be complex to a newcomer, they still have the option stick to the more traditional path.

I'm quite happy for processes like these to require their own documentation because the resource benefits are significant. In my opinion, there is no way to have best practice pipelines and also have everything newcomer friendly. We already do things that are not newcomer friendly like the way some parameters are separated into input: while the rest come in through another route. I know why it's there, but it still trips up every person I've taught.

These processes could also help serve to improve developer skills by demonstrating practical examples showing groovy tips and tricks, and better resource usage.

For pipeline readability, the developer can alias the name to something more inline what they intend to use the generalised process for. The rest of the complexity won't matter as long as it brings the benefits it was intended for.

@adamrtalbot
Copy link
Copy Markdown
Contributor

adamrtalbot commented Dec 20, 2023

@adamrtalbot . Then maybe #3310 could be reopened because it's exactly what you're advocating for ?

I think so. We've got a bit dogmatic about "1 tool, 1 module", to the point of losing sight of why that rule exists. In my mind, a module should be an single unit of work, not a piece of software. It's often the case that these two are the same but not always and sometimes nf-core pipelines are absurdly inefficient because everything runs in it's own module.

We already do things that are not newcomer friendly

Then let's make things easier to use and more friendly. Not more complicated because reasons.

These processes could also help serve to improve developer skills by demonstrating practical examples showing groovy tips and tricks, and better resource usage.

Nah, they'll just make a local module in their pipeline.

To be fair, I quite like this module, think it's really good code and love the flexibility. I'd happy accept this under the golden rule: Minimum Useful Thing™️, but I'm firmly in the camp we should be simplifying and reducing nf-core, not adding complexity.

@mahesh-panchal
Copy link
Copy Markdown
Member

We've got a bit dogmatic about "1 tool, 1 module", to the point of losing sight of why that rule exists. In my mind, a module should be an single unit of work, not a piece of software.

This would be a positive change, but is everyone in agreement (to generally allow multiple commands)?

These processes could also help serve to improve developer skills by demonstrating practical examples showing groovy tips and tricks, and better resource usage.

Nah, they'll just make a local module in their pipeline.

Then I would argue we're not annotating modules well enough and not making them findable. I think to help others, it would be useful if we added more comments to modules explaining anything that might be complicated.

To be fair, I quite like this module, think it's really good code and love the flexibility. I'd happy accept this under the golden rule: Minimum Useful Thing™️, but I'm firmly in the camp we should be simplifying and reducing nf-core, not adding complexity.

I think we're in agreement here to make things as simple as possible. I don't think anyone wants to add complexity unless it's a necessity.

@muffato
Copy link
Copy Markdown
Member Author

muffato commented Jan 10, 2024

FYI, there is already a module that runs samtools commands with a pipe: https://github.com/nf-core/modules/blob/master/modules/nf-core/samtools/collatefastq/main.nf
It does samtools collate | samtools fastq, which could very well be achieved with the two existing modules SAMTOOLS_COLLATE and SAMTOOLS_FASTQ, but is of course more efficient as a pipe in a single process

@mahesh-panchal
Copy link
Copy Markdown
Member

In my opinion that's one of the modules that doesn't follow guidelines and shouldn't be in the modules folder ( or more specifically should not have been approved - the placement is still up for discussion). However, generally, this too should have a place along side the process in this PR and the other optimized processes suggested.

@adamrtalbot
Copy link
Copy Markdown
Contributor

samtools fastq/fasta must take an input from samtools collate, otherwise it generate illogical but valid FASTQ file, i.e. a silent error, see the examples. Including a samtools fastq/fasta module is pretty dangerous because someone naive might install it without realising.

We shouldn't follow the guidelines if they cause people to create broken pipelines.

@mahesh-panchal
Copy link
Copy Markdown
Member

That's only for read pairs, that the input needs to be collated.

@mahesh-panchal
Copy link
Copy Markdown
Member

But I also hope none of our guidelines cause people to make broken pipelines. At the end of the day, it is up to the developer to check that their code is correct, hopefully through rigorous testing and peer-review.

@matthdsm
Copy link
Copy Markdown
Contributor

Can anybody add what was decided in the last maintainers meeting? I couldn't make it, but I'm very interested in either this or my sormadup proposal.

@mahesh-panchal
Copy link
Copy Markdown
Member

See nf-core/website#2327 for a basic summary and start on guidelines. @maxulysse is in charge of this. Basically this PR might not be maintainable by the average nf-core user (most maintainers were not in favor of it), and maintainers would like the script to be explicit with the tool, in particular so argsN maps to a particular tool/subcommand (i.e. like yours). However, they're also worried about maintenance burden, and since one can make local modules, these compound processes should likely go there unless they can be used across multiple pipelines. There's still space to provide more opinions though.

@matthdsm
Copy link
Copy Markdown
Contributor

Thanks @mahesh-panchal!
@maxulysse, In light of those decisions, should I put more work towards the sormadup module? I'm sure it can be used elsewhere, seeing there's an entire subworkflows that does the same, albeit slower and more resource intensive.
I think @priyanka-surana would be interesed too for the the sanger-tol pipelines.

@priyanka-surana
Copy link
Copy Markdown
Contributor

@matthdsm We (@muffato) are already using your implementation as a local module. Would be great to have this directly from nf-core.

@muffato
Copy link
Copy Markdown
Member Author

muffato commented Feb 29, 2024

Yes @matthdsm . In sanger-tol we've got sormadup already (almost exactly your version, the only change I did was removing the sort_memory), but also soon collate | fasta and view | fastq

@maxulysse
Copy link
Copy Markdown
Member

I'm in favor of resurrecting #3310, that sounds like a good candidate for a meta module that will be wonders in term of data footprint savings

@matthdsm matthdsm mentioned this pull request Feb 29, 2024
17 tasks
@SPPearce
Copy link
Copy Markdown
Contributor

I was a dissenting voice on this, I like the flexibility of this "module" and would merge this in.
But yes, modules chaining together multiple tools are acceptable, it was just the complexity of this one that is not currently in favour.

@muffato muffato closed this Mar 2, 2024
@muffato muffato deleted the samtools_pipeline branch March 2, 2024 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants